Super-resolution Benthic Object Detection¶
In our last notebook, we evaluated MBARI's Montery Bay Bethic Object Detector on the external out TrashCAN dataset. There, we found that model results were good, given the slight adaptations we had to make to compare against the new annotaitons. However, we also saw potential for increased model performance when applying some types of upscaling to the input images.
In this note, we will build a workflow to easily feed the inputs from the TrashCAN dataset through a super-resolution layer before feeding them into the MBARI model. We will then evaluate the performance of the model with and without the super-resolution layer to see how much, if any improvement we can achieve. It will also be important to measure copmutation time and memory usage to see if the tradeoff is worth it.
In later notes, we can explore fine-tuning the set-up built here. This is important to keep in mind when making decisions about how to implement the super-resolution layer.
%load_ext autoreload
%autoreload 2
#%pip install -r ../requirements.txt
from fathomnet.models.yolov5 import YOLOv5Model
from IPython.display import display
from pathlib import Path
from PIL import Image
from pycocotools.coco import COCO
from typing import List
import json
import onnxruntime
import os
import numpy as np
root_dir = Path(os.getcwd().split("personal/")[0])
repo_dir = root_dir / "personal" / "ocean-species-identification"
Load¶
We will start by loading the TrashCAN dataset, the MBARI model, and label map between the two. Aside from path building, each requires only a single line of code to load.
data_dir = root_dir / "data" / "TrashCAN"
benthic_model_weights_path = root_dir / "personal" / "models" / "fathomnet_benthic" / "mbari-mb-benthic-33k.pt"
benthic_model = YOLOv5Model(benthic_model_weights_path)
trashcan_data = COCO(data_dir / "dataset" / "material_version" / "instances_val_trashcan.json")
benthic2trashcan_ids = json.load(open(repo_dir / "data" / "benthic2trashcan_ids.json"))
Using cache found in /Users/per.morten.halvorsen@schibsted.com/.cache/torch/hub/ultralytics_yolov5_master YOLOv5 🚀 2024-2-24 Python-3.11.5 torch-2.2.1 CPU
Fusing layers... Model summary: 476 layers, 91841704 parameters, 0 gradients Adding AutoShape...
loading annotations into memory... Done (t=0.17s) creating index... index created!
Super resolution model¶
In our first notebok, we mentioned a super-resolution model that could possibly help the model perform slightly better on our held out dataset. The model we used here was the ESRGAN model, which is a generative adversarial network (GAN) that is trained to upscale images.
The general idea is that if we can enhance some of the abstract fine-grained patterns in the images, the larger picture may be easier to interpret for the model. Considering that YOLO architectures contain stack of convolutional layers, this hypothesis seems reasonable. Convolutional layers consider a small window of the input image at a time, before pooling the results into a single output. If some lower-level features were to be enhanced through a super-resolution layer, the model may be able to make better use of them.
In our first note, we chose the ABPN based model as our inital upscaler due to its light-weight architecture and ease-of-use. In this note, we will consider a few hand-picked models to measure any performance differences between architectures. There are a few different base components to consider when choosing a super-resolution model. Some emphasize context, others emphasize flexibility. The base architectural components we will consider here are: GANs, CNNs, and Attention. Click on the three sections below for more details on these, or check out their papers in the research/ folder.
ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks
Uses a generator-discriminator network, similar to SRGAN:
Two major changes were made to the generator here:
1. They removed batch normalization inside the dense blocks.
The reasoning behind this was that batch normalization introduces artifacts during evaluation, the model is using an estimated mean and variance for the normalization from the training. This becomes a problem in data sets where training and test sets can vary quite a lot.
Additionally, empirical observations have shown that removing batch-normalization increases generalization and performance, while lowering computational cost.
2. They introduced Residual-in-Residual Dense Block (RRDB).
Connecting all layers through residual connectors is expected to boost perforamnce, as it allows the model to learn more complex features. There has been other work on these "multilevel residual networks" that have shown to improve performance in other tasks. Though keep in mind, this added complexity may also increase the amount of computation time. When optimizing models for object detection underwater, it should usually be a point to get the best model as small as possible, to fit in the head of a ROV.
Some other key improves from that architecture include:
- Relativistic discriminator: prediicts a probability of an image being real or not, rather than a binary decision.
- Refined perceptual loss: constraining (applying) the loss on the feature before activation functions to preserve more information.
- Introduce network interpolation: using a threshold to balancing two models: the fine-tuned GAN and a peak signal noise ratio model. This allows to easily balance quality of outputs without having to retrain the model.
ABPN: Anchor-based Plain Net for Mobile Image Super-Resolution
As an 8INT quantized model, this models is aimed to be as small as possible, to run on mobile devices. It can "restore twenty-seven 1080P images (x3 scale) per second, maintaining good perceptual quality as well." In other words, its fast and computationally cheap.
Applies the residual learning to the image space, rather than the feature space, allowing to ensure signal preservation.
This is the backbone architecture of the model we will use. The model we found was a PyTorch adaptation of the original model, which was written in TensorFlow. We opted for the PyTorch model, since it was easier to use out-of-the-box.
DAN: Real Image Denoising with Feature Attention
Finally, an attention model!
We'll have to research this model more in depth later.
Steps to take:
- (single) feed image from COCO through super resolution models
- (single) feed outputs through MBARI model & show detections
- (full) wrap dataflow into a pipeline
Single image through super-resolution model¶
ABPN¶
The first model we will look into is the sr_mobile_pytorch with the Anchor-based Plain Net backbone, since it was the first model we came across.
Let's start with a single example, to learn the input and output formats.
# TODO add other variations of this model: x2, x4, x8
onnx_model_path = root_dir / "personal" / "models" / "sr_mobile_python" / "models_modelx4.ort"
# reuse some code from preivous notebook ported to src.data
os.chdir(repo_dir)
from src.data import *
starfish_images = images_per_category("animal_starfish", trashcan_data, data_dir / "dataset" / "material_version" / "val")
# starfish_images[0]
for i in range(5):
print(i)
example_image = Image.open(starfish_images[i])
print(np.array(example_image).shape)
display(example_image)
print("~~"*40)
0 (360, 480, 3)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 (270, 480, 3)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2 (270, 480, 3)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3 (270, 480, 3)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4 (270, 480, 3)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
example_image_path = starfish_images[3]
example_image = Image.open(example_image_path)
print(np.array(example_image).shape)
display(example_image)
(270, 480, 3)
The following methods were adapted from the sr_mobile_python's inference module.
import numpy as np
import cv2
import onnxruntime
from glob import glob
import os
from tqdm.auto import tqdm
def pre_process(img: np.array) -> np.array:
# H, W, C -> C, H, W
img = np.transpose(img[:, :, 0:3], (2, 0, 1))
# C, H, W -> 1, C, H, W
img = np.expand_dims(img, axis=0).astype(np.float32)
return img
def post_process(img: np.array) -> np.array:
# 1, C, H, W -> C, H, W
img = np.squeeze(img)
# C, H, W -> H, W, C
img = np.transpose(img, (1, 2, 0))
return img
def save(img: np.array, save_name: str) -> None:
cv2.imwrite(save_name, img)
def inference(model_path: str, img_array: np.array) -> np.array:
# unasure about ability to train an onnx model from a Mac
ort_session = onnxruntime.InferenceSession(model_path)
ort_inputs = {ort_session.get_inputs()[0].name: img_array}
ort_outs = ort_session.run(None, ort_inputs)
return ort_outs[0]
def upscale(image_paths, model_path):
outputs = []
for image_path in tqdm(image_paths):
img = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
# filename = os.path.basename(image_path)
if img.ndim == 2:
img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
if img.shape[2] == 4:
alpha = img[:, :, 3] # GRAY
alpha = cv2.cvtColor(alpha, cv2.COLOR_GRAY2BGR) # BGR
alpha_output = post_process(
inference(model_path, pre_process(alpha))
) # BGR
alpha_output = cv2.cvtColor(alpha_output, cv2.COLOR_BGR2GRAY) # GRAY
img = img[:, :, 0:3] # BGR
image_output = post_process(inference(model_path, pre_process(img))) # BGR
output_img = cv2.cvtColor(image_output, cv2.COLOR_BGR2BGRA) # BGRA
output_img[:, :, 3] = alpha_output
# save(output_img, f"{save_path}/{filename}")
elif img.shape[2] == 3:
image_output = post_process(inference(model_path, pre_process(img))) # BGR
# save(image_output, f"{save_path}/{filename}")
outputs += [image_output.astype('uint8')]
return outputs
example_upscaled = upscale([str(example_image_path)], onnx_model_path)[0]
print(example_upscaled.shape)
Image.fromarray(example_upscaled)
0%| | 0/1 [00:00<?, ?it/s]
(1080, 1920, 3)
# # reshow the original image for comparison
# Image.fromarray(np.array(example_image))
# check the scale of the super-resolution image
x_scale = example_upscaled.shape[1] / example_image.size[0]
y_scale = example_upscaled.shape[0] / example_image.size[1]
(x_scale, y_scale)
(4.0, 4.0)
ESRGAN¶
DAN¶
Single super-resolution output through MBARI model¶
example_detections = benthic_model._model(example_image)
upscaled_detections = benthic_model._model(example_upscaled)
example_detections.show()
upscaled_detections.show()
Here we see what we are trying to acheive with this super-resolution layer.
TODO: Look into why the color is off. The hue seems to be a bit redder in th eupscaled version.
The first few examples had little to no improvement, so we went with index 4 to see when such a pipeline might be useful. This is a form of cherry-picking our results, but mainly for visualization purposes. The final evaluation will fairly compare the methods, without any influence on input data.
Build prediction pipeline¶
onnx_model_path
PosixPath('/Users/per.morten.halvorsen@schibsted.com/personal/models/sr_mobile_python/models_modelx4.ort')
from fathomnet.models.yolov5 import YOLOv5Model
class YOLOv5ModelWithUpscale(YOLOv5Model):
def __init__(self, detection_model_path: str, upscale_model_path: str = None):
super().__init__(detection_model_path)
self.upscale_model_path = upscale_model_path
def forward(self, X: List[str]):
if self.upscale_model_path:
X = upscale(X, self.upscale_model_path)
return self._model(X)
upscale_model = YOLOv5ModelWithUpscale(benthic_model_weights_path, onnx_model_path)
upscaled_detections = upscale_model.forward([str(example_image_path)]) # upscale expects a list of image paths
upscaled_detections.show()
Using cache found in /Users/per.morten.halvorsen@schibsted.com/.cache/torch/hub/ultralytics_yolov5_master YOLOv5 🚀 2024-2-24 Python-3.11.5 torch-2.2.1 CPU Fusing layers... Model summary: 476 layers, 91841704 parameters, 0 gradients Adding AutoShape...
0%| | 0/1 [00:00<?, ?it/s]
I'll add a somewhat hacky fix here, to make sure our call methods between the two models are the same. This will help standardize our evaluation setup later on.
def forward(self, X: List[str]):
return self._model(X)
benthic_model.forward = forward.__get__(benthic_model)
example_detections = benthic_model.forward([str(example_image_path)])
example_detections.show()
Full category classifications¶
As a sanity check, let us see if we can produce predictions for a large number of images. Here, we'll use the "Eel" class, since that category seemed to have fewest images, as observed in the previous notebook.
N = 5
raw_starfish_detections = benthic_model.forward(starfish_images[:N])
upscaled_starfish_detections = upscale_model.forward(starfish_images[:N])
raw_starfish_detections.show()
upscaled_starfish_detections.show()
0%| | 0/5 [00:00<?, ?it/s]
Great! Now we can easily feed the TrashCAN dataset through the super-resolution model and then through the MBARI model. Let's get the evaluation methods developed in the last notebook and us ethem to compare our models.
Evaluation¶
Our evaluation will contain three main steps:
- Import the methods from our previous notebook
- Evaluate both the
benthic_modeland theupscaler_model - Compare the results of the two models
We start by importing the methods from the previous notebook. These methods were ported to stand-alone code, for cleaner imports.
from src.evaluation import *
# rebuild somneeded params locally
trashcan_ids = {
row["supercategory"]: id
for id, row in trashcan_data.cats.items()
}
# find trash index
trash_idx = list(benthic_model._model.names.values()).index("trash")
print(benthic_model._model.names[trash_idx])
# find trash labels
trashcan_trash_labels = {
id: name
for name, id in trashcan_ids.items()
if name.startswith("trash")
}
trashcan_trash_labels
trash
{9: 'trash_etc',
10: 'trash_fabric',
11: 'trash_fishing_gear',
12: 'trash_metal',
13: 'trash_paper',
14: 'trash_plastic',
15: 'trash_rubber',
16: 'trash_wood'}
# replace str keys with ints
benthic2trashcan_ids = {
int(key): value
for key, value in benthic2trashcan_ids.items()
}
Run evaluation on both models¶
raw_starfish_metrics = evaluate_model(
category="animal_starfish",
data=trashcan_data,
model=benthic_model,
id_map=benthic2trashcan_ids,
# verbose=2,
# N=5,
one_idx=trash_idx,
many_idx=trashcan_trash_labels,
exclude_ids=[trashcan_ids["rov"], trashcan_ids["plant"]],
path_prefix=data_dir / "dataset" / "material_version" / "val"
)
raw_starfish_metrics
Precision: 0.39534882801514354 Recall: 0.08415841542495835 Average IoU: tensor(0.31285)
{'precision': 0.39534882801514354,
'recall': 0.08415841542495835,
'iou': tensor(0.31285),
'time': 41.07221722602844}
upscale_starfish_metrics = evaluate_model(
category="animal_starfish",
data=trashcan_data,
model=upscale_model,
id_map=benthic2trashcan_ids,
# verbose=2,
# N=5,
one_idx=trash_idx,
many_idx=trashcan_trash_labels,
exclude_ids=[trashcan_ids["rov"], trashcan_ids["plant"]],
path_prefix=data_dir / "dataset" / "material_version" / "val",
x_scale=x_scale,
y_scale=y_scale
)
upscale_starfish_metrics
0%| | 0/46 [00:00<?, ?it/s]
Precision: 0.20312499682617194 Recall: 0.06435643532496814 Average IoU: tensor(0.15016)
{'precision': 0.20312499682617194,
'recall': 0.06435643532496814,
'iou': tensor(0.15016),
'time': 45.43256592750549}
Metrics for all categories¶
def evaluate_both_models(category, N=-1, verbose=False):
raw_metrics = evaluate_model(
category=category,
data=trashcan_data,
model=benthic_model,
id_map=benthic2trashcan_ids,
verbose=verbose,
N=N,
one_idx=trash_idx,
many_idx=trashcan_trash_labels,
exclude_ids=[trashcan_ids["rov"], trashcan_ids["plant"]],
path_prefix=data_dir / "dataset" / "material_version" / "val"
)
upscale_metrics = evaluate_model(
category=category,
data=trashcan_data,
model=upscale_model,
id_map=benthic2trashcan_ids,
verbose=verbose,
N=N,
one_idx=trash_idx,
many_idx=trashcan_trash_labels,
exclude_ids=[trashcan_ids["rov"], trashcan_ids["plant"]],
path_prefix=data_dir / "dataset" / "material_version" / "val",
x_scale=x_scale,
y_scale=y_scale
)
return raw_metrics, upscale_metrics
raw_fish_metrics, upscale_fish_metrics = evaluate_both_models("animal_fish")
print(raw_fish_metrics)
print(upscale_fish_metrics)
0%| | 0/100 [00:00<?, ?it/s]
{'precision': 0.4166666608796297, 'recall': 0.11406844063091848, 'iou': tensor(0.32055), 'time': 81.14837098121643}
{'precision': 0.30769229585798863, 'recall': 0.030418250834911596, 'iou': tensor(0.21463), 'time': 83.67938709259033}
raw_eel_metrics, upscale_eel_metrics = evaluate_both_models("animal_eel")
print(raw_eel_metrics)
print(upscale_eel_metrics)
0%| | 0/73 [00:00<?, ?it/s]
{'precision': 0.19565216965973545, 'recall': 0.05142857113469388, 'iou': tensor(0.14774), 'time': 48.71544289588928}
{'precision': 0.0, 'recall': 0.0, 'iou': 0.0, 'time': 53.536354064941406}
raw_crab_metrics, upscale_crab_metrics = evaluate_both_models("animal_crab")
print(raw_crab_metrics)
print(upscale_crab_metrics)
0%| | 0/39 [00:00<?, ?it/s]
{'precision': 0.07692307573964499, 'recall': 0.03246753225670434, 'iou': tensor(0.06751), 'time': 36.593260049819946}
{'precision': 0.006535947669699688, 'recall': 0.006493506451340867, 'iou': tensor(0.00869), 'time': 38.949223041534424}
raw_trash_metrics, upscale_trash_metrics = evaluate_both_models("trash_plastic")
print(raw_trash_metrics)
print(upscale_fish_metrics)
0%| | 0/340 [00:00<?, ?it/s]
Conclusion¶
Wrap things up and make a plan for next steps.
Idea:
- Fine-tuning
- add extra final output layer to MBARI model mapping 691 outputs to 17 TrashCAN labels
- select iamges from FathomNet that have annotations for the TrashCAN labels
- fine-tune the model on the FathomNet images
- Evaluate on TrashCAN dataset
- Deeper compare analysis
- compare the performance of the super-resolution models on the FathomNet dataset
- Manually observe annotations from TrashCAN and FathomNet to empirically evaluate quality